Udacity Data Analyst Nanodegree - Project 5: Pisa2012 Results Analysis -

Exploratory and Explanatory Data Visualisation, by Dyke Ferber


I. Import libraries


Import the required libraries. We will use pandas, seaborn and matplotlib for data visualisations, numpy for some statistical computing, zipfile to extract the initial .zip file downloaded from Udacity and plot inline with the magic % command.

In [1]:
import numpy as np
import pprint
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import zipfile
import os

%matplotlib inline
%config InlineBackend.figure_format = "retina"
In [2]:
# copied from https://stackoverflow.com/questions/21971449/how-do-i-increase-the-cell-width-of-the-jupyter-ipython-notebook-in-my-browser
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

II. Import the dataset


Initially, extract the zipfile, then import the dataframe as a whole which is quite computationally expensive as well as the explanations dictionary. The entire .zip-file is several gigabites in size (2.3 GB), so uploading to github is not possible. However the dataset can be obtained via https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisa2012.csv.zip.

In [3]:
with zipfile.ZipFile(os.path.join(os.getcwd(), "pisa2012.csv.zip")) as zipped:
    zipped.extractall()
In [4]:
pisa_df = pd.read_csv("pisa2012.csv", sep=",", encoding='latin-1')
pisa_dict = pd.read_csv("pisadict2012.csv", sep=",", encoding='latin-1')
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (15,16,17,21,22,23,24,25,26,30,31,36,37,45,65,123,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,475) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

III. Initial Exploration


Explore the dataset programmatically to identify any data tidiness or quality issues that we need to adress before starting exploratory data analysis. Also reduce the dataset size to only keep samples or variables that are of interest in the further exploration. For data wrangling, though computationally intense, create copies of both datasets.

In [5]:
pisa = pisa_df.copy()
pdict = pisa_dict.copy()
In [6]:
pisa.head()
Out[6]:
Unnamed: 0 CNT SUBNATIO STRATUM OECD NC SCHOOLID STIDSTD ST01Q01 ST02Q01 ... W_FSTR75 W_FSTR76 W_FSTR77 W_FSTR78 W_FSTR79 W_FSTR80 WVARSTRR VAR_UNIT SENWGT_STU VER_STU
0 1 Albania 80000 ALB0006 Non-OECD Albania 1 1 10 1.0 ... 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 19 1 0.2098 22NOV13
1 2 Albania 80000 ALB0006 Non-OECD Albania 1 2 10 1.0 ... 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 19 1 0.2098 22NOV13
2 3 Albania 80000 ALB0006 Non-OECD Albania 1 3 9 1.0 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13
3 4 Albania 80000 ALB0006 Non-OECD Albania 1 4 9 1.0 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13
4 5 Albania 80000 ALB0006 Non-OECD Albania 1 5 9 1.0 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13

5 rows × 636 columns

In [7]:
pisa.describe()
Out[7]:
Unnamed: 0 SUBNATIO SCHOOLID STIDSTD ST01Q01 ST02Q01 ST03Q01 ST03Q02 ST06Q01 ST115Q01 ... W_FSTR74 W_FSTR75 W_FSTR76 W_FSTR77 W_FSTR78 W_FSTR79 W_FSTR80 WVARSTRR VAR_UNIT SENWGT_STU
count 485490.000000 4.854900e+05 485490.000000 485490.000000 485490.000000 485438.000000 485490.000000 485490.000000 457994.000000 479269.000000 ... 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000
mean 242745.500000 4.315457e+06 240.152197 6134.066201 9.813323 2.579260 6.558512 1996.070061 6.148963 1.265356 ... 50.844201 51.020378 50.943149 50.685275 51.019842 50.540724 50.721164 40.013920 1.531189 0.140054
std 140149.035432 2.524434e+06 278.563016 6733.144944 3.734726 2.694013 3.705244 0.255250 0.970693 0.578992 ... 120.684726 122.946533 121.170883 119.267686 122.981541 119.479516 119.799018 22.951264 0.539759 0.137864
min 1.000000 8.000000e+04 1.000000 1.000000 7.000000 1.000000 1.000000 1996.000000 4.000000 1.000000 ... 0.292900 0.292900 0.292900 0.292900 0.292900 0.292900 0.292900 1.000000 1.000000 0.000500
25% 121373.250000 2.030000e+06 61.000000 1811.000000 9.000000 1.000000 4.000000 1996.000000 6.000000 1.000000 ... 4.660300 4.664800 4.643100 4.667000 4.675200 4.651850 4.660300 20.000000 1.000000 0.037800
50% 242745.500000 4.100000e+06 136.000000 3740.000000 10.000000 1.000000 7.000000 1996.000000 6.000000 1.000000 ... 13.637700 13.698900 13.611700 13.672100 13.731100 13.582000 13.600200 40.000000 2.000000 0.145200
75% 364117.750000 6.880000e+06 291.000000 7456.000000 10.000000 3.000000 9.000000 1996.000000 7.000000 1.000000 ... 41.233500 41.512500 41.695200 41.097300 41.189600 41.290925 41.356000 60.000000 2.000000 0.199900
max 485490.000000 8.580000e+06 1471.000000 33806.000000 96.000000 25.000000 99.000000 1997.000000 16.000000 4.000000 ... 2476.566800 4155.283000 3743.450100 3232.163700 3904.868100 3607.478300 3412.174100 80.000000 3.000000 5.095500

8 rows × 268 columns

In [8]:
pisa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 485490 entries, 0 to 485489
Columns: 636 entries, Unnamed: 0 to VER_STU
dtypes: float64(250), int64(18), object(368)
memory usage: 2.3+ GB
In [9]:
pisa.shape
Out[9]:
(485490, 636)

As handling the dictionary-like style of the pdictdataframe is quite, create a dictionary pisadictfor rapid lookups between column abbreviations and their respective meaning. This should increase comfort in cleaning the dataset.

In [10]:
pdict.rename(columns={"Unnamed: 0":"key",
             "x":"value"}, inplace=True)

pisadict = dict(zip(pdict.key, pdict.value))
In [11]:
pdict.head(10)
Out[11]:
key value
0 CNT Country code 3-character
1 SUBNATIO Adjudicated sub-region code 7-digit code (3-di...
2 STRATUM Stratum ID 7-character (cnt + region ID + orig...
3 OECD OECD country
4 NC National Centre 6-digit Code
5 SCHOOLID School ID 7-digit (region ID + stratum ID + 3-...
6 STIDSTD Student ID
7 ST01Q01 International Grade
8 ST02Q01 National Study Programme
9 ST03Q01 Birth - Month
In [12]:
len(pisadict)
Out[12]:
635

The entire dataframe contains 635 different columns, almost all of which are not in our main focus for this task. Therefore I manually selected column IDs as the index of the pdict dataframe using numpys .r_[ ] method to create a slice of column indices to retrieve from the pdict keys.

In [13]:
relevant_keys = np.r_[0,6,7,9:12, 17:20, 26, 31, 32, 37, 43:57,60:66, 140:145, 460, 466, 476]
relevant_keys
Out[13]:
array([  0,   6,   7,   9,  10,  11,  17,  18,  19,  26,  31,  32,  37,
        43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
        56,  60,  61,  62,  63,  64,  65, 140, 141, 142, 143, 144, 460,
       466, 476])
In [14]:
columns = pdict.iloc[relevant_keys]["key"].to_list()

Also the relevant columns containing students math, science and reading score are in the dataset. All of them contain the regex PV(possible value) and can be extracted by some simple logic.

In [15]:
for key, value in pisadict.items():
    if "PV" in key and "subscale" not in value:
        columns.append(key)
In [16]:
print(columns)
['CNT', 'STIDSTD', 'ST01Q01', 'ST03Q01', 'ST03Q02', 'ST04Q01', 'ST08Q01', 'ST09Q01', 'ST115Q01', 'ST13Q01', 'ST15Q01', 'ST17Q01', 'ST19Q01', 'ST26Q01', 'ST26Q02', 'ST26Q03', 'ST26Q04', 'ST26Q05', 'ST26Q06', 'ST26Q07', 'ST26Q08', 'ST26Q09', 'ST26Q10', 'ST26Q11', 'ST26Q12', 'ST26Q13', 'ST26Q14', 'ST27Q01', 'ST27Q02', 'ST27Q03', 'ST27Q04', 'ST27Q05', 'ST28Q01', 'ST57Q01', 'ST57Q02', 'ST57Q03', 'ST57Q04', 'ST57Q05', 'LMINS', 'MMINS', 'SMINS', 'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ', 'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE']
In [17]:
pisa = pisa.loc[:,columns]
In [18]:
pisa.shape
Out[18]:
(485490, 56)
In [19]:
pisa.head(3)
Out[19]:
CNT STIDSTD ST01Q01 ST03Q01 ST03Q02 ST04Q01 ST08Q01 ST09Q01 ST115Q01 ST13Q01 ... PV1READ PV2READ PV3READ PV4READ PV5READ PV1SCIE PV2SCIE PV3SCIE PV4SCIE PV5SCIE
0 Albania 1 10 2 1996 Female None None 1.0 <ISCED level 3A> ... 249.5762 254.3420 406.8496 175.7053 218.5981 341.7009 408.8400 348.2283 367.8105 392.9877
1 Albania 2 10 2 1996 Female One or two times None 1.0 <ISCED level 3A> ... 406.2936 349.8975 400.7334 369.7553 396.7618 548.9929 471.5964 471.5964 443.6218 454.8116
2 Albania 3 9 9 1996 Female None None 1.0 <ISCED level 3B, 3C> ... 401.2100 404.3872 387.7067 431.3938 401.2100 499.6643 428.7952 492.2044 512.7191 499.6643

3 rows × 56 columns

Check for NaN values in our dataset, that now contains only 56 columns (variables). Also check out the corresponding datatypes.

In [20]:
[print("Col: ", col, "NA: ", pisa[col].isna().sum(), "/////", "Dtype:", pisa[col].dtype) for col in columns];
Col:  CNT NA:  0 ///// Dtype: object
Col:  STIDSTD NA:  0 ///// Dtype: int64
Col:  ST01Q01 NA:  0 ///// Dtype: int64
Col:  ST03Q01 NA:  0 ///// Dtype: int64
Col:  ST03Q02 NA:  0 ///// Dtype: int64
Col:  ST04Q01 NA:  0 ///// Dtype: object
Col:  ST08Q01 NA:  6347 ///// Dtype: object
Col:  ST09Q01 NA:  6359 ///// Dtype: object
Col:  ST115Q01 NA:  6221 ///// Dtype: float64
Col:  ST13Q01 NA:  27511 ///// Dtype: object
Col:  ST15Q01 NA:  17739 ///// Dtype: object
Col:  ST17Q01 NA:  42229 ///// Dtype: object
Col:  ST19Q01 NA:  34080 ///// Dtype: object
Col:  ST26Q01 NA:  12411 ///// Dtype: object
Col:  ST26Q02 NA:  15797 ///// Dtype: object
Col:  ST26Q03 NA:  13470 ///// Dtype: object
Col:  ST26Q04 NA:  11613 ///// Dtype: object
Col:  ST26Q05 NA:  22312 ///// Dtype: object
Col:  ST26Q06 NA:  12308 ///// Dtype: object
Col:  ST26Q07 NA:  19630 ///// Dtype: object
Col:  ST26Q08 NA:  18396 ///// Dtype: object
Col:  ST26Q09 NA:  18241 ///// Dtype: object
Col:  ST26Q10 NA:  14248 ///// Dtype: object
Col:  ST26Q11 NA:  21924 ///// Dtype: object
Col:  ST26Q12 NA:  11451 ///// Dtype: object
Col:  ST26Q13 NA:  16375 ///// Dtype: object
Col:  ST26Q14 NA:  11414 ///// Dtype: object
Col:  ST27Q01 NA:  8411 ///// Dtype: object
Col:  ST27Q02 NA:  8942 ///// Dtype: object
Col:  ST27Q03 NA:  12031 ///// Dtype: object
Col:  ST27Q04 NA:  12991 ///// Dtype: object
Col:  ST27Q05 NA:  15847 ///// Dtype: object
Col:  ST28Q01 NA:  11725 ///// Dtype: object
Col:  ST57Q01 NA:  184123 ///// Dtype: float64
Col:  ST57Q02 NA:  215682 ///// Dtype: float64
Col:  ST57Q03 NA:  201677 ///// Dtype: float64
Col:  ST57Q04 NA:  205833 ///// Dtype: float64
Col:  ST57Q05 NA:  195988 ///// Dtype: float64
Col:  LMINS NA:  202624 ///// Dtype: float64
Col:  MMINS NA:  202187 ///// Dtype: float64
Col:  SMINS NA:  214576 ///// Dtype: float64
Col:  PV1MATH NA:  0 ///// Dtype: float64
Col:  PV2MATH NA:  0 ///// Dtype: float64
Col:  PV3MATH NA:  0 ///// Dtype: float64
Col:  PV4MATH NA:  0 ///// Dtype: float64
Col:  PV5MATH NA:  0 ///// Dtype: float64
Col:  PV1READ NA:  0 ///// Dtype: float64
Col:  PV2READ NA:  0 ///// Dtype: float64
Col:  PV3READ NA:  0 ///// Dtype: float64
Col:  PV4READ NA:  0 ///// Dtype: float64
Col:  PV5READ NA:  0 ///// Dtype: float64
Col:  PV1SCIE NA:  0 ///// Dtype: float64
Col:  PV2SCIE NA:  0 ///// Dtype: float64
Col:  PV3SCIE NA:  0 ///// Dtype: float64
Col:  PV4SCIE NA:  0 ///// Dtype: float64
Col:  PV5SCIE NA:  0 ///// Dtype: float64

In case we can remove NaN values from floattype columns, replace them with the .mean( ) of the column.

In [21]:
for col in columns:
    if pisa[col].dtype == "float64":
        pisa[col].fillna(pisa[col].mean())
In [22]:
[print("Col: ", col, "NA: ", pisa[col].isna().sum(), "/////", "Dtype:", pisa[col].dtype) for col in columns];
Col:  CNT NA:  0 ///// Dtype: object
Col:  STIDSTD NA:  0 ///// Dtype: int64
Col:  ST01Q01 NA:  0 ///// Dtype: int64
Col:  ST03Q01 NA:  0 ///// Dtype: int64
Col:  ST03Q02 NA:  0 ///// Dtype: int64
Col:  ST04Q01 NA:  0 ///// Dtype: object
Col:  ST08Q01 NA:  6347 ///// Dtype: object
Col:  ST09Q01 NA:  6359 ///// Dtype: object
Col:  ST115Q01 NA:  6221 ///// Dtype: float64
Col:  ST13Q01 NA:  27511 ///// Dtype: object
Col:  ST15Q01 NA:  17739 ///// Dtype: object
Col:  ST17Q01 NA:  42229 ///// Dtype: object
Col:  ST19Q01 NA:  34080 ///// Dtype: object
Col:  ST26Q01 NA:  12411 ///// Dtype: object
Col:  ST26Q02 NA:  15797 ///// Dtype: object
Col:  ST26Q03 NA:  13470 ///// Dtype: object
Col:  ST26Q04 NA:  11613 ///// Dtype: object
Col:  ST26Q05 NA:  22312 ///// Dtype: object
Col:  ST26Q06 NA:  12308 ///// Dtype: object
Col:  ST26Q07 NA:  19630 ///// Dtype: object
Col:  ST26Q08 NA:  18396 ///// Dtype: object
Col:  ST26Q09 NA:  18241 ///// Dtype: object
Col:  ST26Q10 NA:  14248 ///// Dtype: object
Col:  ST26Q11 NA:  21924 ///// Dtype: object
Col:  ST26Q12 NA:  11451 ///// Dtype: object
Col:  ST26Q13 NA:  16375 ///// Dtype: object
Col:  ST26Q14 NA:  11414 ///// Dtype: object
Col:  ST27Q01 NA:  8411 ///// Dtype: object
Col:  ST27Q02 NA:  8942 ///// Dtype: object
Col:  ST27Q03 NA:  12031 ///// Dtype: object
Col:  ST27Q04 NA:  12991 ///// Dtype: object
Col:  ST27Q05 NA:  15847 ///// Dtype: object
Col:  ST28Q01 NA:  11725 ///// Dtype: object
Col:  ST57Q01 NA:  184123 ///// Dtype: float64
Col:  ST57Q02 NA:  215682 ///// Dtype: float64
Col:  ST57Q03 NA:  201677 ///// Dtype: float64
Col:  ST57Q04 NA:  205833 ///// Dtype: float64
Col:  ST57Q05 NA:  195988 ///// Dtype: float64
Col:  LMINS NA:  202624 ///// Dtype: float64
Col:  MMINS NA:  202187 ///// Dtype: float64
Col:  SMINS NA:  214576 ///// Dtype: float64
Col:  PV1MATH NA:  0 ///// Dtype: float64
Col:  PV2MATH NA:  0 ///// Dtype: float64
Col:  PV3MATH NA:  0 ///// Dtype: float64
Col:  PV4MATH NA:  0 ///// Dtype: float64
Col:  PV5MATH NA:  0 ///// Dtype: float64
Col:  PV1READ NA:  0 ///// Dtype: float64
Col:  PV2READ NA:  0 ///// Dtype: float64
Col:  PV3READ NA:  0 ///// Dtype: float64
Col:  PV4READ NA:  0 ///// Dtype: float64
Col:  PV5READ NA:  0 ///// Dtype: float64
Col:  PV1SCIE NA:  0 ///// Dtype: float64
Col:  PV2SCIE NA:  0 ///// Dtype: float64
Col:  PV3SCIE NA:  0 ///// Dtype: float64
Col:  PV4SCIE NA:  0 ///// Dtype: float64
Col:  PV5SCIE NA:  0 ///// Dtype: float64
In [23]:
pisa.duplicated().any()
Out[23]:
False

Rename column abbreviations with their true name, then shorten it.

In [24]:
for col in pisa.columns:
    pisa.rename(columns={col:pisadict[col]}, inplace=True)
In [25]:
pisa.columns
Out[25]:
Index(['Country code 3-character', 'Student ID', 'International Grade',
       'Birth - Month', 'Birth -Year', 'Gender', 'Truancy - Late for School',
       'Truancy - Skip whole school day',
       'Truancy - Skip classes within school day', 'Mother<Highest Schooling>',
       'Mother Current Job Status', 'Father<Highest Schooling>',
       'Father Current Job Status', 'Possessions - desk',
       'Possessions - own room', 'Possessions - study place',
       'Possessions - computer', 'Possessions - software',
       'Possessions - Internet', 'Possessions - literature',
       'Possessions - poetry', 'Possessions - art', 'Possessions - textbooks',
       'Possessions - <technical reference books>', 'Possessions - dictionary',
       'Possessions - dishwasher', 'Possessions - <DVD>',
       'How many - cellular phones', 'How many - televisions',
       'How many - computers', 'How many - cars',
       'How many - rooms bath or shower', 'How many books at home',
       'Out-of-School Study Time - Homework',
       'Out-of-School Study Time - Guided Homework',
       'Out-of-School Study Time - Personal Tutor',
       'Out-of-School Study Time - Commercial Company',
       'Out-of-School Study Time - With Parent',
       'Learning time (minutes per week)  - <test language>',
       'Learning time (minutes per week)- <Mathematics>',
       'Learning time (minutes per week) - <Science>',
       'Plausible value 1 in mathematics', 'Plausible value 2 in mathematics',
       'Plausible value 3 in mathematics', 'Plausible value 4 in mathematics',
       'Plausible value 5 in mathematics', 'Plausible value 1 in reading',
       'Plausible value 2 in reading', 'Plausible value 3 in reading',
       'Plausible value 4 in reading', 'Plausible value 5 in reading',
       'Plausible value 1 in science', 'Plausible value 2 in science',
       'Plausible value 3 in science', 'Plausible value 4 in science',
       'Plausible value 5 in science'],
      dtype='object')
In [26]:
rename_dict = {'Country code 3-character': "Country_Code", 
               'Student ID': "Student_ID", 
               'International Grade': "International_Grade",
               'Birth - Month': "Birth_Month", 
               'Birth -Year': "Birth_Year", 
               'Gender':"Sex", 
               'Truancy - Late for School': "Late_For_School",
               'Truancy - Skip whole school day': "Skip_Whole_School_Day",
               'Truancy - Skip classes within school day': "Skip_Classes_Within_School",
               'Mother<Highest Schooling>':"Mother_Highest_Schooling",
               'Mother Current Job Status':"Mother_Job_Status",
               'Father<Highest Schooling>':"Father_Highest_Schooling", 
               'Father Current Job Status':"Father_Job_Status",
               'Possessions - desk':"Poss_Desk", 
               'Possessions - own room':"Poss_Own_Room",
               'Possessions - study place':"Poss_Study_Place", 
               'Possessions - computer':"Poss_Computer",
               'Possessions - software':"Poss_Software", 
               'Possessions - Internet':"Poss_Internet",
               'Possessions - literature':"Poss_Literature", 
               'Possessions - poetry':"Poss_Poetry", 
               'Possessions - art':"Poss_Art",
               'Possessions - textbooks':"Poss_Textbooks", 
               'Possessions - <technical reference books>':"Poss_Technical_Ref_Book",
               'Possessions - dictionary':"Poss_Dict", 
               'Possessions - dishwasher':"Poss_Dishwasher",
               'Possessions - <DVD>':"Poss_DVD", 
               'How many - cellular phones':"Count_Cell_Phone",
               'How many - televisions':"Count_TV", 
               'How many - computers':"Count_Computer", 
               'How many - cars':"Count_Cars",
               'How many - rooms bath or shower':"Count_Bath_Rooms", 
               'How many books at home':"Count_Books",
               'Out-of-School Study Time - Homework':"Homework_Time",
               'Out-of-School Study Time - Guided Homework':"Guided_Homework_Time",
               'Out-of-School Study Time - Personal Tutor':"Personal_Tutor_Time",
               'Out-of-School Study Time - Commercial Company':"Commercial_Company_Time",
               'Out-of-School Study Time - With Parent':"Study_Time_with_Parent",
               'Learning time (minutes per week)  - <test language>':"Learn_Time_Test_Language",
               'Learning time (minutes per week)- <Mathematics>':"Learn_Time_Test_Math",
               'Learning time (minutes per week) - <Science>':"Learn_Time_Test_Science",
               'Plausible value 1 in mathematics':"Value_1_Math", 
               'Plausible value 2 in mathematics':"Value_2_Math",
               'Plausible value 3 in mathematics':"Value_3_Math", 
               'Plausible value 4 in mathematics':"Value_4_Math",
               'Plausible value 5 in mathematics':"Value_5_Math", 
               'Plausible value 1 in reading':"Value_1_Read",
               'Plausible value 2 in reading':"Value_2_Read", 
               'Plausible value 3 in reading':"Value_3_Read",
               'Plausible value 4 in reading':"Value_4_Read", 
               'Plausible value 5 in reading':"Value_5_Read",
               'Plausible value 1 in science':"Value_1_Science", 
               'Plausible value 2 in science':"Value_2_Science",
               'Plausible value 3 in science':"Value_3_Science", 
               'Plausible value 4 in science':"Value_4_Science",
               'Plausible value 5 in science':"Value_5_Science"}

pisa.rename(rename_dict, axis=1, inplace=True)
In [27]:
pisa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 485490 entries, 0 to 485489
Data columns (total 56 columns):
Country_Code                  485490 non-null object
Student_ID                    485490 non-null int64
International_Grade           485490 non-null int64
Birth_Month                   485490 non-null int64
Birth_Year                    485490 non-null int64
Sex                           485490 non-null object
Late_For_School               479143 non-null object
Skip_Whole_School_Day         479131 non-null object
Skip_Classes_Within_School    479269 non-null float64
Mother_Highest_Schooling      457979 non-null object
Mother_Job_Status             467751 non-null object
Father_Highest_Schooling      443261 non-null object
Father_Job_Status             451410 non-null object
Poss_Desk                     473079 non-null object
Poss_Own_Room                 469693 non-null object
Poss_Study_Place              472020 non-null object
Poss_Computer                 473877 non-null object
Poss_Software                 463178 non-null object
Poss_Internet                 473182 non-null object
Poss_Literature               465860 non-null object
Poss_Poetry                   467094 non-null object
Poss_Art                      467249 non-null object
Poss_Textbooks                471242 non-null object
Poss_Technical_Ref_Book       463566 non-null object
Poss_Dict                     474039 non-null object
Poss_Dishwasher               469115 non-null object
Poss_DVD                      474076 non-null object
Count_Cell_Phone              477079 non-null object
Count_TV                      476548 non-null object
Count_Computer                473459 non-null object
Count_Cars                    472499 non-null object
Count_Bath_Rooms              469643 non-null object
Count_Books                   473765 non-null object
Homework_Time                 301367 non-null float64
Guided_Homework_Time          269808 non-null float64
Personal_Tutor_Time           283813 non-null float64
Commercial_Company_Time       279657 non-null float64
Study_Time_with_Parent        289502 non-null float64
Learn_Time_Test_Language      282866 non-null float64
Learn_Time_Test_Math          283303 non-null float64
Learn_Time_Test_Science       270914 non-null float64
Value_1_Math                  485490 non-null float64
Value_2_Math                  485490 non-null float64
Value_3_Math                  485490 non-null float64
Value_4_Math                  485490 non-null float64
Value_5_Math                  485490 non-null float64
Value_1_Read                  485490 non-null float64
Value_2_Read                  485490 non-null float64
Value_3_Read                  485490 non-null float64
Value_4_Read                  485490 non-null float64
Value_5_Read                  485490 non-null float64
Value_1_Science               485490 non-null float64
Value_2_Science               485490 non-null float64
Value_3_Science               485490 non-null float64
Value_4_Science               485490 non-null float64
Value_5_Science               485490 non-null float64
dtypes: float64(24), int64(4), object(28)
memory usage: 207.4+ MB

For later easier wrangling of related group of columns, store column names that match to the categories below as list type.

In [28]:
scores = []
learning_time = []
counts = []
possessions = []
truancy = []
parents = []
base_info = []

for key, value in rename_dict.items():
    if key.startswith("Plausible"):
        scores.append(value)
    elif key.startswith("Learning") or key.startswith("Out-of-School"):
        learning_time.append(value)
    elif key.startswith("How many"):
        counts.append(value)
    elif key.startswith("Possessions"):
        possessions.append(value)
    elif key.startswith("Truancy"):
        truancy.append(value)
    elif key.startswith("Mother") or key.startswith("Father"):
        parents.append(value)
    else:
        base_info.append(value)

IV. Data Wrangling: Tidiness and Quality Issues


Issues:

  1. All columns in the possessions list are of type object (string), convert them to categoric.
  2. All columns in the counts list contain missing information, as this is not restorable, drop rows with NaN.
  3. All columns in the counts list are of type object (string), convert them to categoric with order (increasing).
  4. The columns Late_For_School and Skip_Whole_School_Day are of type object (string), convert them to sorted categorical.
  5. All columns in the parents list contain NaN values. As these don't appear to mean None (as it turns out as its own value inside this column) we cannot know what these values could mean, most likely the were not explicitly recorded as None, or the meant something totally different. In any case drop them to avoid biasing our results based on assuming their meaning.
  6. The table currently misses some statistically relevant columns like average grading or average / total learning time. Create them.
  7. Sex column is of type object, convert it to category.

  1. All columns in the possessions list are of type object (string), convert them to categoric.
In [29]:
for pos in possessions:
    pisa[pos] = pisa[pos].astype("category")
In [30]:
print(pisa[counts].isna().any())
for count in counts:
    pisa = pisa.loc[pisa[count].notnull(),:]
print("-------------------------")
print(pisa[counts].isna().any())
Count_Cell_Phone    True
Count_TV            True
Count_Computer      True
Count_Cars          True
Count_Bath_Rooms    True
Count_Books         True
dtype: bool
-------------------------
Count_Cell_Phone    False
Count_TV            False
Count_Computer      False
Count_Cars          False
Count_Bath_Rooms    False
Count_Books         False
dtype: bool

  1. All columns in the counts list contain missing information, as this is not restorable, drop rows with NaN.
  2. All columns in the counts list are of type object (string), convert them to categoric with order (increasing).
In [31]:
for count in counts:
    print(count, "--->", pisa[count].unique())
Count_Cell_Phone ---> ['Two' 'Three or more' 'One' 'None']
Count_TV ---> ['One' 'Three or more' 'Two' 'None']
Count_Computer ---> ['None' 'Three or more' 'Two' 'One']
Count_Cars ---> ['None' 'Two' 'One' 'Three or more']
Count_Bath_Rooms ---> ['None' 'Two' 'One' 'Three or more']
Count_Books ---> ['0-10 books ' '201-500 books ' 'More than 500 books' '11-25 books '
 '101-200 books ' '26-100 books ']
In [32]:
count_order = ["None", "One", "Two", "Three or more"]
count_classes = pd.api.types.CategoricalDtype(categories = count_order, ordered=True)

count_book_order = ['0-10 books ', '11-25 books ', '26-100 books ', '101-200 books ', '201-500 books ', 'More than 500 books']
count_book_classes = pd.api.types.CategoricalDtype(categories = count_book_order, ordered=True)

for count in counts:
    if sorted(count_order) == sorted(pisa[count].unique()):
        pisa[count] = pisa[count].astype(count_classes)
    else:
        pisa[count] = pisa[count].astype(count_book_classes)

  1. The columns Late_For_School and Skip_Whole_School_Day are of type object (string), convert them to sorted categorical.
In [33]:
pisa.Late_For_School.unique()
Out[33]:
array(['None  ', 'One or two times  ', 'Three or four times  ',
       'Five or more times  ', nan], dtype=object)
In [34]:
pisa = pisa.loc[pisa.Late_For_School.notnull(),:]
late_for_school_order = list(pisa.Late_For_School.unique())
late_for_school_classes = pd.api.types.CategoricalDtype(categories = late_for_school_order, ordered=True)
pisa["Late_For_School"] = pisa["Late_For_School"].astype(late_for_school_classes)
In [35]:
pisa = pisa.loc[pisa.Skip_Whole_School_Day.notnull(),:]
skip_whole_school_day_order = list(pisa.Skip_Whole_School_Day.unique())
skip_whole_school_day_classes = pd.api.types.CategoricalDtype(categories = skip_whole_school_day_order, ordered=True)
pisa["Skip_Whole_School_Day"] = pisa["Skip_Whole_School_Day"].astype(skip_whole_school_day_classes)

  1. All columns in the parents list contain NaN values. As these don't appear to mean None (as it turns out as its own value inside this column) we cannot know what these values could mean, most likely the were not explicitly recorded as None, or the meant something totally different. In any case drop them to avoid biasing our results based on assuming their meaning.
In [36]:
pisa = pisa.loc[pisa.Mother_Highest_Schooling.notnull(),:]
pisa = pisa.loc[pisa.Father_Highest_Schooling.notnull(),:]
pisa = pisa.loc[pisa.Mother_Job_Status.notnull(),:]
pisa = pisa.loc[pisa.Father_Job_Status.notnull(),:]

pisa["Mother_Highest_Schooling"].replace('She did not complete <ISCED level 1> ', "<ISCED level 0>", inplace=True)
pisa["Father_Highest_Schooling"].replace('He did not complete <ISCED level 1> ', "<ISCED level 0>", inplace=True)

print(pisa.Mother_Highest_Schooling.unique())
print(pisa.Father_Highest_Schooling.unique())
['<ISCED level 3A> ' '<ISCED level 3B, 3C> ' '<ISCED level 0>'
 '<ISCED level 2> ' '<ISCED level 1> ']
['<ISCED level 3A> ' '<ISCED level 3B, 3C> ' '<ISCED level 2> '
 '<ISCED level 0>' '<ISCED level 1> ']
In [37]:
isced_levels = sorted(pisa.Mother_Highest_Schooling.unique())
isced_classes = pd.api.types.CategoricalDtype(categories=isced_levels, ordered=True)
In [38]:
for par in parents:
    if str(par).find("Job_Status"):
        pisa[par] = pisa[par].astype("category")
    else:
        pisa[par] = pisa[par].astype(isced_classes)        

  1. The table currently misses some statistically relevant columns like average grading or average / total learning time. Create them.
     1. Create a "Total_Learning_Time" column that sums up all variables contained in the learning_time list
     2. Create average columns for math, science and reading scores: "math_avg_score", "read_avg_score" and "science_avg_score"

  1. Create a Total_Learning_Time column that sums up all variables contained in the learning_time list.
In [39]:
pisa[learning_time].head()
Out[39]:
Homework_Time Guided_Homework_Time Personal_Tutor_Time Commercial_Company_Time Study_Time_with_Parent Learn_Time_Test_Language Learn_Time_Test_Math Learn_Time_Test_Science
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 6.0 0.0 0.0 315.0 270.0 90.0
2 NaN 6.0 6.0 7.0 2.0 300.0 NaN NaN
3 10.0 2.0 2.0 0.0 0.0 135.0 135.0 90.0
4 NaN NaN NaN NaN NaN NaN NaN NaN
In [40]:
pisa[learning_time] = pisa[learning_time].replace(np.nan, 0)
In [41]:
pisa["Total_Learning_Time"] = sum([pisa[x] for x in learning_time])
pisa["Total_Learning_Time"].head()
Out[41]:
0      0.0
1    681.0
2    321.0
3    374.0
4      0.0
Name: Total_Learning_Time, dtype: float64

  1. Create average columns for math, science and reading scores: math_avg_score, read_avg_score and science_avg_score
In [42]:
scores
Out[42]:
['Value_1_Math',
 'Value_2_Math',
 'Value_3_Math',
 'Value_4_Math',
 'Value_5_Math',
 'Value_1_Read',
 'Value_2_Read',
 'Value_3_Read',
 'Value_4_Read',
 'Value_5_Read',
 'Value_1_Science',
 'Value_2_Science',
 'Value_3_Science',
 'Value_4_Science',
 'Value_5_Science']
In [43]:
pisa["global_score"] = sum(pisa[score] for score in scores) / len(scores)
In [44]:
pisa["math_score"] = sum(pisa[score] for score in scores[:5]) / len(scores[:5])
pisa["read_score"] = sum(pisa[score] for score in scores[5:10]) / len(scores[5:10])
pisa["science_score"] = sum(pisa[score] for score in scores[10:]) / len(scores[10:])
In [45]:
print(pisa.global_score.head(3))
print("--------------")
print(pisa.math_score.head(3))
print("--------------")
print(pisa.read_score.head(3))
print("--------------")
print(pisa.science_score.head(3))
0    333.038020
1    444.458700
2    465.776413
Name: global_score, dtype: float64
--------------
0    366.18634
1    470.56396
2    505.53824
Name: math_score, dtype: float64
--------------
0    261.01424
1    384.68832
2    405.18154
Name: read_score, dtype: float64
--------------
0    371.91348
1    478.12382
2    486.60946
Name: science_score, dtype: float64
In [46]:
print(pisa.global_score.describe())
print("----------------------")
print(pisa.math_score.describe())
print("----------------------")
print(pisa.read_score.describe())
print("----------------------")
print(pisa.science_score.describe())
count    397816.000000
mean        482.394740
std          93.191618
min          88.394973
25%         415.573887
50%         482.238130
75%         549.406933
max         826.592027
Name: global_score, dtype: float64
----------------------
count    397816.000000
mean        479.556236
std          98.877418
min         101.347560
25%         407.080520
50%         476.172300
75%         548.847180
max         903.107960
Name: math_score, dtype: float64
----------------------
count    397816.000000
mean        482.073496
std          95.075826
min          38.813920
25%         417.175620
50%         485.318920
75%         550.196260
max         849.359740
Name: read_score, dtype: float64
----------------------
count    397816.000000
mean        485.554490
std          95.216779
min          25.158540
25%         417.325620
50%         485.583740
75%         554.214820
max         857.832900
Name: science_score, dtype: float64

  1. Sex column is of type object, convert it to category.
In [47]:
pisa["Sex"] = pisa.Sex.astype("category")

IV. Final Checkup: Data quality, tidiness

In [48]:
pisa.describe()
Out[48]:
Student_ID International_Grade Birth_Month Birth_Year Skip_Classes_Within_School Homework_Time Guided_Homework_Time Personal_Tutor_Time Commercial_Company_Time Study_Time_with_Parent ... Value_1_Science Value_2_Science Value_3_Science Value_4_Science Value_5_Science Total_Learning_Time global_score math_score read_score science_score
count 397816.000000 397816.000000 397816.000000 397816.000000 397175.000000 397816.000000 397816.000000 397816.000000 397816.000000 397816.000000 ... 397816.000000 397816.000000 397816.000000 397816.000000 397816.000000 397816.000000 397816.000000 397816.000000 397816.000000 397816.000000
mean 6216.843460 9.811068 6.548751 1996.068243 1.253037 3.597827 0.933816 0.563529 0.538540 0.728912 ... 485.523312 485.559302 485.610548 485.531903 485.547384 396.442506 482.394740 479.556236 482.073496 485.554490
std 6808.278415 3.384215 3.706446 0.252162 0.562006 5.145149 2.130458 1.719550 1.874386 1.902097 ... 98.704537 98.770978 98.746420 98.789864 98.843054 372.852047 93.191618 98.877418 95.075826 95.216779
min 1.000000 7.000000 1.000000 1996.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 6.844500 2.834800 11.879900 8.429700 17.754600 0.000000 88.394973 101.347560 38.813920 25.158540
25% 1827.000000 9.000000 4.000000 1996.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 415.926900 415.926900 415.926900 415.740400 415.647100 0.000000 415.573887 407.080520 417.175620 417.325620
50% 3771.000000 10.000000 7.000000 1996.000000 1.000000 2.000000 0.000000 0.000000 0.000000 0.000000 ... 485.583700 485.583700 485.583700 485.677000 485.583700 453.000000 482.238130 476.172300 485.318920 485.583740
75% 7610.000000 10.000000 9.000000 1996.000000 1.000000 5.000000 1.000000 0.000000 0.000000 1.000000 ... 555.520300 555.520300 555.520300 555.520300 555.520300 663.000000 549.406933 548.847180 550.196260 554.214820
max 33806.000000 96.000000 99.000000 1997.000000 4.000000 30.000000 30.000000 30.000000 30.000000 30.000000 ... 903.338300 900.540800 867.624000 926.557300 880.958600 4200.000000 826.592027 903.107960 849.359740 857.832900

8 rows × 33 columns

In [49]:
pisa.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 397816 entries, 0 to 485489
Data columns (total 61 columns):
Country_Code                  397816 non-null object
Student_ID                    397816 non-null int64
International_Grade           397816 non-null int64
Birth_Month                   397816 non-null int64
Birth_Year                    397816 non-null int64
Sex                           397816 non-null category
Late_For_School               397816 non-null category
Skip_Whole_School_Day         397816 non-null category
Skip_Classes_Within_School    397175 non-null float64
Mother_Highest_Schooling      397816 non-null category
Mother_Job_Status             397816 non-null category
Father_Highest_Schooling      397816 non-null category
Father_Job_Status             397816 non-null category
Poss_Desk                     394628 non-null category
Poss_Own_Room                 395340 non-null category
Poss_Study_Place              393876 non-null category
Poss_Computer                 395459 non-null category
Poss_Software                 388309 non-null category
Poss_Internet                 395116 non-null category
Poss_Literature               390290 non-null category
Poss_Poetry                   390774 non-null category
Poss_Art                      391133 non-null category
Poss_Textbooks                393585 non-null category
Poss_Technical_Ref_Book       388662 non-null category
Poss_Dict                     395254 non-null category
Poss_Dishwasher               392339 non-null category
Poss_DVD                      395356 non-null category
Count_Cell_Phone              397816 non-null category
Count_TV                      397816 non-null category
Count_Computer                397816 non-null category
Count_Cars                    397816 non-null category
Count_Bath_Rooms              397816 non-null category
Count_Books                   397816 non-null category
Homework_Time                 397816 non-null float64
Guided_Homework_Time          397816 non-null float64
Personal_Tutor_Time           397816 non-null float64
Commercial_Company_Time       397816 non-null float64
Study_Time_with_Parent        397816 non-null float64
Learn_Time_Test_Language      397816 non-null float64
Learn_Time_Test_Math          397816 non-null float64
Learn_Time_Test_Science       397816 non-null float64
Value_1_Math                  397816 non-null float64
Value_2_Math                  397816 non-null float64
Value_3_Math                  397816 non-null float64
Value_4_Math                  397816 non-null float64
Value_5_Math                  397816 non-null float64
Value_1_Read                  397816 non-null float64
Value_2_Read                  397816 non-null float64
Value_3_Read                  397816 non-null float64
Value_4_Read                  397816 non-null float64
Value_5_Read                  397816 non-null float64
Value_1_Science               397816 non-null float64
Value_2_Science               397816 non-null float64
Value_3_Science               397816 non-null float64
Value_4_Science               397816 non-null float64
Value_5_Science               397816 non-null float64
Total_Learning_Time           397816 non-null float64
global_score                  397816 non-null float64
math_score                    397816 non-null float64
read_score                    397816 non-null float64
science_score                 397816 non-null float64
dtypes: category(27), float64(29), int64(4), object(1)
memory usage: 116.5+ MB
In [50]:
pisa.head(100)
Out[50]:
Country_Code Student_ID International_Grade Birth_Month Birth_Year Sex Late_For_School Skip_Whole_School_Day Skip_Classes_Within_School Mother_Highest_Schooling ... Value_1_Science Value_2_Science Value_3_Science Value_4_Science Value_5_Science Total_Learning_Time global_score math_score read_score science_score
0 Albania 1 10 2 1996 Female None None 1.0 <ISCED level 3A> ... 341.7009 408.8400 348.2283 367.8105 392.9877 0.0 333.038020 366.18634 261.01424 371.91348
1 Albania 2 10 2 1996 Female One or two times None 1.0 <ISCED level 3A> ... 548.9929 471.5964 471.5964 443.6218 454.8116 681.0 444.458700 470.56396 384.68832 478.12382
2 Albania 3 9 9 1996 Female None None 1.0 <ISCED level 3B, 3C> ... 499.6643 428.7952 492.2044 512.7191 499.6643 321.0 465.776413 505.53824 405.18154 486.60946
3 Albania 4 9 8 1996 Female None None 1.0 <ISCED level 3B, 3C> ... 438.6796 481.5740 448.9370 474.1141 426.5573 374.0 460.296973 449.45476 477.46376 453.97240
4 Albania 5 9 10 1996 Female One or two times None 2.0 <ISCED level 0> ... 361.5628 275.7740 372.7527 403.5248 422.1746 0.0 336.223953 385.50398 256.01010 367.15778
5 Albania 6 9 9 1996 Female One or two times None 1.0 <ISCED level 3B, 3C> ... 384.3156 358.2059 405.7628 403.8978 325.5688 24.0 359.383587 365.17370 337.42688 375.55018
6 Albania 7 10 3 1996 Female Three or four times None 1.0 <ISCED level 3B, 3C> ... 508.1499 462.4580 499.7576 519.3398 466.1880 460.0 451.144060 490.11528 372.13824 491.17866
7 Albania 8 10 6 1996 Male None None 1.0 <ISCED level 3A> ... 611.2831 596.3633 554.4013 513.3719 519.8993 0.0 539.268953 542.14830 516.59478 559.06378
9 Albania 10 10 8 1996 Female One or two times One or two times 2.0 <ISCED level 3A> ... 404.2708 343.6591 339.9291 335.2667 287.7098 618.0 392.655987 433.87600 401.92486 342.16710
10 Albania 11 10 5 1996 Female Three or four times None 1.0 <ISCED level 3A> ... 438.0269 455.7441 473.4614 466.0015 465.0690 504.0 483.364380 438.93910 551.49346 459.66058
11 Albania 12 10 5 1996 Female None None 1.0 <ISCED level 3A> ... 340.1156 296.2887 182.5252 189.9851 290.6938 242.0 245.435740 268.35178 208.03376 259.92168
12 Albania 13 10 3 1996 Female None None 1.0 <ISCED level 2> ... 150.5408 240.0597 217.6800 164.5282 199.0302 0.0 201.715953 231.89750 178.88258 194.36778
13 Albania 14 10 5 1996 Female None None 1.0 <ISCED level 3A> ... 489.5002 546.3819 501.6225 498.8251 490.4327 564.0 485.269047 448.20844 502.24622 505.35248
15 Albania 16 10 7 1996 Male None None 1.0 <ISCED level 2> ... 402.5923 359.6979 420.3096 344.7781 359.6979 0.0 373.299747 367.43262 375.05146 377.41516
16 Albania 17 10 7 1996 Female None One or two times 3.0 <ISCED level 2> ... 145.2257 167.6054 211.4323 301.8836 303.7486 0.0 220.288747 245.99622 188.89090 225.97912
17 Albania 18 10 2 1996 Male None None 1.0 <ISCED level 2> ... 511.0407 483.0660 522.2305 483.0660 504.5132 379.0 528.943653 527.27062 558.77706 500.78328
18 Albania 19 10 6 1996 Female None None 1.0 <ISCED level 3A> ... 350.3730 405.3898 346.6430 352.2380 345.7106 730.0 375.754100 416.81728 350.37414 360.07088
19 Albania 20 10 8 1996 Female None None 1.0 <ISCED level 3A> ... 509.2689 419.7501 408.5602 483.1593 438.3999 690.0 454.341487 453.19364 458.00314 451.82768
21 Albania 22 10 1 1996 Male Three or four times None 1.0 <ISCED level 2> ... 371.9135 483.8120 335.5464 439.9851 413.8754 0.0 414.138653 393.05968 440.32980 409.02648
22 Albania 23 10 6 1996 Female None None 1.0 <ISCED level 2> ... 277.3592 267.1018 274.5617 271.7643 222.3424 652.0 252.596973 230.41752 264.74752 262.62588
23 Albania 24 10 3 1996 Female None None 1.0 <ISCED level 2> ... 468.5192 465.7217 448.9370 440.5446 355.6882 0.0 430.376753 442.60008 412.64804 435.88214
24 Albania 25 10 5 1996 Male None None 1.0 <ISCED level 2> ... 315.8709 323.3308 290.6938 300.0186 294.4237 0.0 316.013547 314.15328 329.01980 304.86756
25 Albania 26 10 12 1996 Female None None 1.0 <ISCED level 2> ... 349.4405 381.1451 384.8750 395.1324 362.4953 412.0 370.775753 373.19676 364.51284 374.61766
26 Albania 27 10 3 1996 Female Three or four times None 1.0 <ISCED level 3A> ... 361.5628 447.3517 352.2380 399.7949 440.8243 481.0 394.579953 385.11450 398.27102 400.35434
27 Albania 28 10 5 1996 Male One or two times None 1.0 <ISCED level 3A> ... 459.1943 365.0131 405.1100 384.5953 364.0806 503.0 404.463320 405.28900 412.50230 395.59866
28 Albania 29 10 6 1996 Female One or two times None 1.0 <ISCED level 2> ... 268.1276 359.5114 273.7225 247.6128 297.9672 651.0 325.638487 364.23900 323.28816 289.38830
29 Albania 30 10 4 1996 Female None None 1.0 <ISCED level 3A> ... 542.4655 468.7990 509.8284 542.4655 579.7650 679.0 528.910520 521.50646 536.56042 528.66468
31 Albania 32 10 6 1996 Male None None 2.0 <ISCED level 1> ... 258.9892 272.9765 215.1623 235.6770 118.1835 0.0 222.264587 236.10376 210.49230 220.19770
33 Albania 34 10 4 1996 Male Five or more times None 1.0 <ISCED level 2> ... 499.5711 336.3857 470.6639 483.7188 376.4827 0.0 448.922627 445.32636 468.07708 433.36444
34 Albania 35 10 7 1996 Male One or two times None 2.0 <ISCED level 3A> ... 365.1996 381.9843 380.1194 485.4905 416.4864 410.0 383.789360 395.24070 350.27134 405.85604
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
84 Albania 85 10 8 1996 Female None None 1.0 <ISCED level 2> ... 489.5002 413.0362 439.1458 357.0869 412.1037 281.0 413.893580 402.25112 417.25506 422.17456
85 Albania 86 10 5 1996 Female None None 1.0 <ISCED level 3B, 3C> ... 337.2249 302.7229 340.0224 243.0436 290.6005 370.0 312.708240 316.87956 318.52230 302.72286
86 Albania 87 10 3 1996 Female None None 1.0 <ISCED level 3A> ... 238.4744 111.6561 122.8460 260.8542 119.1160 281.0 172.677033 217.40924 130.03252 170.58934
87 Albania 88 10 6 1996 Male None Three or four times 1.0 <ISCED level 3B, 3C> ... 332.6557 288.8288 389.5375 327.0608 189.0526 409.0 287.222900 319.60586 236.63576 305.42708
88 Albania 89 10 1 1996 Male None None 1.0 <ISCED level 3B, 3C> ... 452.2939 428.9817 399.1421 533.4204 414.9944 549.0 412.884927 379.58406 413.30422 445.76650
89 Albania 90 10 9 1996 Female None None 1.0 <ISCED level 2> ... 275.7740 356.9004 366.2253 287.8963 336.3857 318.0 321.398940 323.50054 316.05994 324.63634
91 Albania 92 10 6 1996 Female None None 1.0 <ISCED level 3A> ... 399.9814 403.7113 374.8042 325.3823 439.1458 0.0 406.208953 424.84032 405.18154 388.60500
92 Albania 93 10 8 1996 Male None None 1.0 <ISCED level 3B, 3C> ... 593.4726 660.6117 592.5401 696.0463 673.6665 551.0 621.290967 598.15392 622.45154 643.26744
94 Albania 95 10 6 1996 Male None None 1.0 <ISCED level 2> ... 481.2943 596.9228 549.3659 567.0832 597.8553 324.0 551.442480 557.41548 538.40766 558.50430
95 Albania 96 10 4 1996 Male None None 2.0 <ISCED level 2> ... 287.2436 267.6613 359.0451 290.9735 280.7162 365.0 309.174827 298.73034 331.66620 297.12794
96 Albania 97 9 6 1996 Male None None 1.0 <ISCED level 3A> ... 462.1783 479.8956 468.7057 415.5539 468.7057 506.0 443.712993 479.67750 392.45364 459.00784
97 Albania 98 10 9 1996 Male None None 1.0 <ISCED level 3B, 3C> ... 521.0183 512.6259 510.7609 568.5752 554.5878 285.0 538.715487 515.27498 567.35786 533.51362
100 Albania 101 10 9 1996 Female One or two times None 1.0 <ISCED level 3A> ... 466.6542 403.2451 392.0552 418.1649 445.2070 180.0 422.420713 443.84640 398.35046 425.06528
103 Albania 104 10 2 1996 Female None None 1.0 <ISCED level 3B, 3C> ... 300.9511 455.7441 327.9933 284.1664 376.4827 369.0 355.035260 352.39914 363.63912 349.06752
104 Albania 105 9 11 1996 Female One or two times One or two times 1.0 <ISCED level 3B, 3C> ... 332.3760 245.6546 313.7262 240.9922 341.7009 390.0 329.339940 366.10842 327.02142 294.88998
106 Albania 107 10 8 1996 Male None None 1.0 <ISCED level 3B, 3C> ... 522.3238 502.7415 533.5136 516.7288 541.9060 0.0 486.520400 507.09612 429.02234 523.44274
109 Albania 110 10 12 1996 Male One or two times None 1.0 <ISCED level 3A> ... 205.8374 207.7023 150.8206 235.6770 168.5379 0.0 180.501213 187.49804 160.29056 193.71504
111 Albania 112 10 5 1996 Female None One or two times 1.0 <ISCED level 3A> ... 441.8501 438.1201 415.7404 359.7911 411.0780 0.0 406.192253 375.45568 429.80514 413.31594
112 Albania 113 10 7 1996 Male One or two times None 1.0 <ISCED level 3A> ... 562.0477 498.6386 498.6386 547.1279 510.7609 640.0 543.186900 556.40286 549.71510 523.44274
113 Albania 114 10 3 1996 Female One or two times None 1.0 <ISCED level 3A> ... 405.3898 306.5461 385.8075 428.7020 397.9299 0.0 417.151533 403.73110 462.84844 384.87506
115 Albania 116 10 10 1996 Female None None 1.0 <ISCED level 3A> ... 523.8157 515.4233 491.1787 392.3349 560.1828 410.0 480.055873 462.46298 481.11756 496.58708
116 Albania 117 10 4 1996 Male One or two times None 2.0 <ISCED level 3A> ... 430.5670 424.0395 428.7020 352.2380 533.1406 0.0 448.149373 449.61054 461.10016 433.73742
118 Albania 119 10 6 1996 Female One or two times None 1.0 <ISCED level 3A> ... 425.2518 450.4289 424.3193 448.5640 414.0619 465.0 428.060807 366.96524 484.69200 432.52518
119 Albania 120 10 2 1996 Female One or two times None 1.0 <ISCED level 2> ... 389.9105 367.5308 237.9150 363.8008 332.0962 0.0 332.674667 305.50708 354.26626 338.25066
120 Albania 121 10 6 1996 Female None One or two times 1.0 <ISCED level 2> ... 268.5938 270.4588 306.8258 249.9441 239.6867 0.0 279.489647 265.23604 306.13106 267.10184
121 Albania 122 10 3 1996 Female None None 2.0 <ISCED level 2> ... 491.1787 448.2842 464.1365 506.0985 397.9299 367.0 473.222713 449.06528 509.07730 461.52556
122 Albania 123 10 8 1996 Female None None 1.0 <ISCED level 3A> ... 486.5162 487.4487 467.8665 498.6386 459.4741 0.0 519.120687 509.19926 568.17398 479.98882
123 Albania 124 10 5 1996 Male One or two times None 1.0 <ISCED level 3A> ... 366.8780 379.9329 346.3633 327.7135 323.0511 448.0 324.807540 302.46924 323.16562 348.78776
124 Albania 125 10 4 1996 Male None None 1.0 <ISCED level 3A> ... 432.5252 415.7404 337.4114 411.0780 384.0358 369.0 385.323280 359.09800 400.71368 396.15816
125 Albania 126 10 2 1996 Male None None 1.0 <ISCED level 2> ... 382.7303 304.4013 429.3547 400.4476 439.6121 0.0 369.983967 335.65196 382.99074 391.30920

100 rows × 61 columns

In [51]:
pisa.dropna(axis=0, inplace=True)
len(pisa.columns)
Out[51]:
61

Exploratory Data Analyis


  1. Country Distributions Analyse the amount of students participating in the PISA2012 study dataset to get a rough distribution at which popoulations we look at here.
In [52]:
order = pisa["Country_Code"].value_counts()
In [53]:
plt.style.use("seaborn");
plt.figure(figsize=(10,10));
sns.countplot(y=pisa["Country_Code"], order = order.index, color="red");
plt.savefig("Countries.png")

Key findings
Italy and Mexico are the countries where most of the students have participated. Overall participation seems not to be related to number of inhabitants in the repsective country, as Germany for instance with its 80 million of inhabitants is on the lower level of the chart. It also appears, that the USA are split into three regions (Florida, Massachusetts and Connecticut), which does not appear to be really intuitive.


  1. Data variable correlations
    Analyse any correlation between the data. Therefore convert categorical columns into numeric.
In [54]:
pisa.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 376339 entries, 0 to 485488
Data columns (total 61 columns):
Country_Code                  376339 non-null object
Student_ID                    376339 non-null int64
International_Grade           376339 non-null int64
Birth_Month                   376339 non-null int64
Birth_Year                    376339 non-null int64
Sex                           376339 non-null category
Late_For_School               376339 non-null category
Skip_Whole_School_Day         376339 non-null category
Skip_Classes_Within_School    376339 non-null float64
Mother_Highest_Schooling      376339 non-null category
Mother_Job_Status             376339 non-null category
Father_Highest_Schooling      376339 non-null category
Father_Job_Status             376339 non-null category
Poss_Desk                     376339 non-null category
Poss_Own_Room                 376339 non-null category
Poss_Study_Place              376339 non-null category
Poss_Computer                 376339 non-null category
Poss_Software                 376339 non-null category
Poss_Internet                 376339 non-null category
Poss_Literature               376339 non-null category
Poss_Poetry                   376339 non-null category
Poss_Art                      376339 non-null category
Poss_Textbooks                376339 non-null category
Poss_Technical_Ref_Book       376339 non-null category
Poss_Dict                     376339 non-null category
Poss_Dishwasher               376339 non-null category
Poss_DVD                      376339 non-null category
Count_Cell_Phone              376339 non-null category
Count_TV                      376339 non-null category
Count_Computer                376339 non-null category
Count_Cars                    376339 non-null category
Count_Bath_Rooms              376339 non-null category
Count_Books                   376339 non-null category
Homework_Time                 376339 non-null float64
Guided_Homework_Time          376339 non-null float64
Personal_Tutor_Time           376339 non-null float64
Commercial_Company_Time       376339 non-null float64
Study_Time_with_Parent        376339 non-null float64
Learn_Time_Test_Language      376339 non-null float64
Learn_Time_Test_Math          376339 non-null float64
Learn_Time_Test_Science       376339 non-null float64
Value_1_Math                  376339 non-null float64
Value_2_Math                  376339 non-null float64
Value_3_Math                  376339 non-null float64
Value_4_Math                  376339 non-null float64
Value_5_Math                  376339 non-null float64
Value_1_Read                  376339 non-null float64
Value_2_Read                  376339 non-null float64
Value_3_Read                  376339 non-null float64
Value_4_Read                  376339 non-null float64
Value_5_Read                  376339 non-null float64
Value_1_Science               376339 non-null float64
Value_2_Science               376339 non-null float64
Value_3_Science               376339 non-null float64
Value_4_Science               376339 non-null float64
Value_5_Science               376339 non-null float64
Total_Learning_Time           376339 non-null float64
global_score                  376339 non-null float64
math_score                    376339 non-null float64
read_score                    376339 non-null float64
science_score                 376339 non-null float64
dtypes: category(27), float64(29), int64(4), object(1)
memory usage: 120.2+ MB
In [55]:
pisa_corr = pisa.copy()
In [56]:
pisa_corr.Sex.replace({"Female":1, "Male":0}, inplace=True)
In [57]:
for pos in possessions:
    pisa_corr[pos].replace({"Yes":1, "No":0}, inplace=True)
In [58]:
for count in counts:
    print(pisa_corr[count].unique())
[Two, Three or more, One, None]
Categories (4, object): [None < One < Two < Three or more]
[One, Three or more, Two, None]
Categories (4, object): [None < One < Two < Three or more]
[None, Three or more, Two, One]
Categories (4, object): [None < One < Two < Three or more]
[None, Two, One, Three or more]
Categories (4, object): [None < One < Two < Three or more]
[None, Two, One, Three or more]
Categories (4, object): [None < One < Two < Three or more]
[0-10 books, 201-500 books, More than 500 books, 11-25 books, 101-200 books, 26-100 books]
Categories (6, object): [0-10 books < 11-25 books < 26-100 books < 101-200 books < 201-500 books < More than 500 books]
In [59]:
for count in counts[:-1]:
    pisa[count].replace({"None":0, "One":1, "Two":2, "Three or more": 3}, inplace=True)
In [60]:
plt.figure(figsize=(30,30))
fig = sns.heatmap(pisa_corr.corr(),
           cmap="coolwarm", annot=True, fmt=".2f",
           linewidths=0.05, linecolor="black");
fig.patch.set_edgecolor("black");
fig.patch.set_linewidth("1")  
plt.savefig("Correlation_plot.png")

Key findings
As expected, score results are highly correlated between each other. Students achieving high test results in science, also trend to reach good grades in math etc. Also some other obvios trends can be obtained from this plot, for instance students possing a computer are highly likely to also have internet. Though this doesnt appear too relevant for our project here, it can help gaining confidence that our multiple data wrangling steps did not introduce any errors into our dataset.


  1. Univariate Data Exploration Explore the distribution of math, science and reading results.
In [61]:
plt.style.use("seaborn")

##### reset to default plt style #####
#plt.rcParams.update(plt.rcParamsDefault)
#%matplotlib inline
#%config InlineBackend.figure_format = "retina"
######################################

plt.figure(figsize=(15,12));
plt.subplot(2,2,1);
bins = np.arange(0, pisa.global_score.max()+20,20);
plt.hist(data=pisa, x="global_score", bins=bins, color="lightgreen");
plt.title("Histogram of Global Average Score of all PISA 2012 students");
plt.xlabel("Average Score in Points");
plt.ylabel("Count of Students");

plt.subplot(2,2,2);
bins = np.arange(0, pisa.global_score.max()+20,20);
plt.hist(data=pisa, x="math_score", bins=bins, color="lightblue");
plt.title("Histogram of Math Average Score of all PISA 2012 students");
plt.xlabel("Average Score in Points");
plt.ylabel("Count of Students");

plt.subplot(2,2,3);
bins = np.arange(0, pisa.global_score.max()+20,20);
plt.hist(data=pisa, x="read_score", bins=bins, color="lightblue");
plt.title("Histogram of Reading Average Score of all PISA 2012 students");
plt.xlabel("Average Score in Points");
plt.ylabel("Count of Students");

plt.subplot(2,2,4);
bins = np.arange(0, pisa.global_score.max()+20,20);
plt.hist(data=pisa, x="science_score", bins=bins, color="lightblue");
plt.title("Histogram of Science Average Score of all PISA 2012 students");
plt.xlabel("Average Score in Points");
plt.ylabel("Count of Students");

plt.savefig("Histogram_Scores.png")

Key findings
All test results appear to be normally distributed, with a center around values of 500 points. No skewness is observed. This would fit the central limit theorem (CLT), that the distribution of datasets with large sample sizes trend towards a almost normal distribution.


  1. Univariate Data Exploration Explore the distribution of study times.
In [62]:
axes = []
fig = plt.figure(figsize=(20,10));
n_rows = 2
n_cols = 4

for idx, col in enumerate(learning_time):
    axes.append(fig.add_subplot(n_rows, n_cols, idx+1));
    data = pisa[col]
    bins = np.arange(0, data.max() + (data.max()/10), 1)
    if col.startswith("Learn"):
        bins = np.arange(0, data.max() + (data.max()/10), 100)
    plt.hist(data, color="lightblue", bins=bins)
    plt.title(col)
    plt.yscale("log")
    plt.xlabel("Time in h")
    plt.ylabel("# of students")
    plt.yticks([10,100,1000,10000,100000,1e6], ["10","100","1k","10k","100k","1m"])
plt.savefig("Study_Time_Histograms.png")

Key findings
As expected, learning times / study times for all categories show a negative correlation to the number of students that fall into each category (wacth out for the logarithmic scale on the y-axis).


  1. Total Learning Time: Are girls spending more time learning for school than boys?
    Create a histogram of the Total Learning Time, separated for boys and girls
In [63]:
gender_diffs = pisa.groupby("Sex")["Total_Learning_Time"]
gender_diffs.agg(["mean", "median", "std"])
Out[63]:
mean median std
Sex
Female 401.027863 456.0 370.941748
Male 396.970614 452.0 375.734790
In [64]:
bins = np.linspace(0, np.max(pisa.Total_Learning_Time)+10,30)

plt.figure(figsize=[4,4])
fig = plt.hist(x = pisa.query("Sex == 'Female'")["Total_Learning_Time"], bins = bins, color="lightgreen", alpha=1, label="Female")
fig = plt.hist(x = pisa.query("Sex == 'Male'")["Total_Learning_Time"], bins = bins, color="lightblue", alpha=1, label="Male")
plt.xlim(0, 2500);
plt.yticks([1e4, 2e4, 3e4, 4e4, 5e4, 6e4, 7e4, 8e4], ["10k","20k","30k","40k","50k","60k","70k", "80k"]);
plt.ylabel("Counts in thousands (1k=1000)");
plt.xlabel("Total Learning Time in hours");
plt.legend(loc="upper right");
plt.title("Learning time separated for boys and girls");
plt.savefig("Learning_Time_Boys_Girls.png")
In [65]:
axes = []
nrows = 2
ncols = 4

fig = plt.figure(figsize=(30,10))

for idx, col in enumerate(learning_time):
    axes.append(fig.add_subplot(nrows, ncols, idx+1))
    plt.title("Heatmap comparing Global Test Scores with '{}'".format(col.replace("_", " ")))
    plt.hist2d(data=pisa, x=col, y="global_score", cmap="mako_r", cmin=0.8)
plt.colorbar();
plt.savefig("Heatmap_Learning_Times.png")

Key findings
Female participants trend to have a slightly higher motivation to spend time learning for school than boys have. To validate this for statistical significance we could either use t-test or bootstrapping.


  1. Possessions What is the distribution of possessing items like Computer, a dishwasher and a own room?
    Create a clustered bar chart of the possession items.
In [66]:
fig = plt.figure(figsize=(30,10))

axes = []
nrows = 2
ncols = 7

for idx, col in enumerate(possessions):
    axes.append(fig.add_subplot(nrows, ncols, idx+1))
    sns.countplot(data=pisa, x=col, hue="Sex", palette=["lightgreen", "lightblue"], order=["Yes","No"])
    plt.yticks([x for x in np.arange(0, 200000, 20000)], ["{}k".format(str(x)[:-3]) if x!=0 else str(x) for x in np.arange(0, 200000, 20000)]);
    plt.title(col.replace("_", " "))
    plt.xlabel(None)
    plt.ylabel("# Counts (1k=1000)")
plt.savefig("Possessions.png")
In [67]:
pisa.groupby("Sex")[possessions].agg("count")
Out[67]:
Poss_Desk Poss_Own_Room Poss_Study_Place Poss_Computer Poss_Software Poss_Internet Poss_Literature Poss_Poetry Poss_Art Poss_Textbooks Poss_Technical_Ref_Book Poss_Dict Poss_Dishwasher Poss_DVD
Sex
Female 191830 191830 191830 191830 191830 191830 191830 191830 191830 191830 191830 191830 191830 191830
Male 184509 184509 184509 184509 184509 184509 184509 184509 184509 184509 184509 184509 184509 184509

Key findings
I here used barplots to count the number of students possessing different items, separated by gender. Girls tend to have more poetry, art and text books than boys, while the latter tend to have more technical reference books.


  1. Counts What is the distribution count of numbers of items like Computer, Cars and bath rooms?
    Create a clustered bar chart of the counts items.
In [68]:
counted = pisa.groupby("Sex")[counts]
counted.agg(["mean", "median", "min", "max", "std"])
Out[68]:
Count_Cell_Phone Count_TV ... Count_Cars Count_Bath_Rooms
mean median min max std mean median min max std ... mean median min max std mean median min max std
Sex
Female 2.791894 3 0 3 0.560956 2.173534 2 0 3 0.805790 ... 1.404535 1 0 3 0.964312 1.461049 1 0 3 0.846842
Male 2.769832 3 0 3 0.599468 2.256611 2 0 3 0.794535 ... 1.452406 1 0 3 0.968613 1.488551 1 0 3 0.835926

2 rows × 25 columns

In [69]:
fig = plt.figure(figsize=(20,6))

axes = []
nrows = 2
ncols = 3

for idx, col in enumerate(counts):
    axes.append(fig.add_subplot(nrows, ncols, idx+1))
    
    if col != "Count_Books":
        sns.countplot(data=pisa, x=col, hue="Sex", palette=["lightgreen", "lightblue"], order=[0,1,2,3])
    else:
        sns.countplot(data=pisa, x=col, hue="Sex", palette=["lightgreen", "lightblue"])
        plt.xticks(rotation = 10)

    plt.yticks([x for x in np.arange(0, 200000, 20000)], ["{}k".format(str(x)[:-3]) if x!=0 else str(x) for x in np.arange(0, 200000, 20000)]);
    plt.title(col.replace("_", " "))
    plt.ylabel("# Counts (1k=1000)")
    plt.xlabel(None)
    
plt.savefig("Counts.png")
In [70]:
pisa.groupby("Count_Bath_Rooms")["global_score"].mean()
Out[70]:
Count_Bath_Rooms
0    428.615180
1    481.835045
2    503.495792
3    495.684616
Name: global_score, dtype: float64
In [71]:
fig = plt.figure(figsize=(20,10))

nrows = 2
ncols = 3
axes = []

for idx, col in enumerate(counts):
    axes.append(fig.add_subplot(nrows, ncols, idx+1))
    sns.pointplot(data=pisa, x=col, y="global_score", hue="Sex", palette=["lightgreen", "lightblue"]);
    plt.ylim([0, 600]);
plt.suptitle("Correlation between score results and number of items")
plt.savefig("Counts_Scores.png")

Key findings
The latter plot is quite interesting. The more bathrooms a student has in his house the better his score results are (while the last shows a slight decrease). This might largely be explained, that number of bathrooms is an indicator of luxury, thus studens like these might have a better access to education, resources like computer and heavily expensive school textbooks. All these trends also occur for number of cars, cell phones at home and computer.


  1. Correlations: Is there any correlation between count of "luxury" items and grade?
    Use seaborns PairGrid to explore any possible trends.
In [72]:
variables = []
variables.extend(counts)
variables.extend(["global_score", "math_score", "science_score", "read_score"])
variables
Out[72]:
['Count_Cell_Phone',
 'Count_TV',
 'Count_Computer',
 'Count_Cars',
 'Count_Bath_Rooms',
 'Count_Books',
 'global_score',
 'math_score',
 'science_score',
 'read_score']
In [73]:
pisa_subset = pisa.sample(500, replace=True)

g = sns.PairGrid(data = pisa_subset, vars = variables);
g.map_diag(plt.hist, color="lightblue");
g.map_offdiag(plt.scatter, color="lightgreen");
In [74]:
pisa.Count_Books.dtype
Out[74]:
CategoricalDtype(categories=['0-10 books ', '11-25 books ', '26-100 books ',
                  '101-200 books ', '201-500 books ', 'More than 500 books'],
                 ordered=True)
In [75]:
pisa["Count_Books_Numeric"] = pisa["Count_Books"].replace({"0-10 books ": np.mean((0,10)),
                                 "11-25 books ": np.mean((11,25)),
                                 "26-100 books ": np.mean((26,100)),
                                 "101-200 books ": np.mean((101,200)),
                                 "201-500 books ": np.mean((201,500)),
                                 "More than 500 books": 501})
variables.append("Count_Books_Numeric")
pisa["Count_Books_Numeric"].head()
Out[75]:
0      5.0
1    350.5
2    501.0
3     18.0
4    150.5
Name: Count_Books_Numeric, dtype: float64
In [76]:
palette = sns.color_palette("Blues")
plt.figure(figsize=(10,10))
fig = sns.heatmap(pisa[variables].corr(),
           cmap=palette, annot=True, fmt=".2f", vmin=0.2);
fig.patch.set_edgecolor("black");
fig.patch.set_linewidth("1")  
plt.title("Correlation heatmap between count of luxury items and global test scores")
plt.xticks(rotation = 20)
plt.savefig("Correlation_plot_possession_scores.png")

  1. Trends: Do students with more books achieve better results?
In [77]:
sns.barplot(data=pisa, x="Count_Books", y="global_score", palette=["lightgreen", "lightblue"], ci="sd", hue="Sex", errwidth = 0.75);
plt.xlabel(None);
plt.xticks(rotation=20)
plt.ylabel("Global Test Score Result");
plt.title("Comparison between # of Books and Global Test Results separated by Gender");
plt.savefig("Books_Global_Scores.png")

Key findings
Not surprisingly, the number of books shows a linear trend to simultaneously rise when test scores rise. As we cannot conclude any direction from the data, it is however obvious that the more books students will own, the more likely they will acquire knowledge while reading them and gain better test scores.

In [78]:
plt.figure(figsize=(25,6))

plt.suptitle("Average Score results for number of Books at home separted by Boys and Girls");

plt.subplot(2,2,1);
sns.violinplot(data=pisa, x="Count_Books", y="global_score", hue="Sex", palette=["lightgreen", "lightblue"]);
plt.xlabel(None);

plt.subplot(2,2,2);
sns.violinplot(data=pisa, x="Count_Books", y="math_score", hue="Sex", palette=["lightgreen", "lightblue"]);
plt.xlabel(None);

plt.subplot(2,2,3);
sns.violinplot(data=pisa, x="Count_Books", y="science_score", hue="Sex", palette=["lightgreen", "lightblue"]);
plt.xlabel(None);

plt.subplot(2,2,4);
sns.violinplot(data=pisa, x="Count_Books", y="read_score", hue="Sex", palette=["lightgreen", "lightblue"]);
plt.xlabel(None);

plt.savefig("Avg_scores_hueGender_books.png")
In [79]:
plt.figure(figsize=(8,14))
plt.suptitle("Average Score results for number of Books at home separted by Boys and Girls");

plt.subplot(4,1,1);
sns.pointplot(data=pisa, y="Count_Books", x="global_score", hue="Sex", palette=["lightgreen", "lightblue"], ci="sd", linestyles = "", dodge=True);
plt.xlabel(None);
plt.title("Global Score");

plt.subplot(4,1,2);
sns.pointplot(data=pisa, y="Count_Books", x="math_score", hue="Sex", palette=["lightgreen", "lightblue"], ci="sd", linestyles = "", dodge=True);
plt.xlabel(None);
plt.title("Math Score");

plt.subplot(4,1,3);
sns.pointplot(data=pisa, y="Count_Books", x="science_score", hue="Sex", palette=["lightgreen", "lightblue"], ci="sd", linestyles = "", dodge=True);
plt.xlabel(None);
plt.title("Science Score");

plt.subplot(4,1,4);
sns.pointplot(data=pisa, y="Count_Books", x="read_score", hue="Sex", palette=["lightgreen", "lightblue"], ci="sd", linestyles = "", dodge=True);
plt.xlabel(None);
plt.title("Reading Score");

plt.savefig("Avg_scores_hueGender_books_pointplot.png")

  1. Trends: Are truants doing worse in tests than those who never miss school?
In [80]:
axes = []
nrows = 1
ncols = 3

fig = plt.figure(figsize=(30,5))

for idx, col in enumerate(truancy):
    axes.append(fig.add_subplot(nrows, ncols, idx+1))
    plt.title("Boxplots comparing Global Test Scores with '{}'".format(col.replace("_", " ")))
    sns.boxplot(data=pisa, x=col, y="global_score", hue="Sex", palette=["lightgreen", "lightblue"])
    
plt.savefig("Truancy_Scores_Boxplot.png")

Key findings
Interestingly, students beeing late for school, skipping classes or whole school days appear to linearly achieve worse results than those who never miss school (None). This trend is equal among girls and boys.


  1. Trends: From which countries are the best students coming from?
In [81]:
grouped = pisa.groupby("Country_Code")[["global_score", "math_score", "read_score", "science_score"]]
In [82]:
grouped.agg(["mean", "median", "std"])
Out[82]:
global_score math_score read_score science_score
mean median std mean median std mean median std mean median std
Country_Code
Albania 396.719697 402.147420 88.274916 394.746217 396.17542 85.389452 396.190101 402.35768 103.826415 399.222773 404.45731 89.336969
Argentina 423.856345 424.375733 76.155280 414.185621 413.85730 72.958209 426.301706 429.56686 88.441046 431.081708 432.15218 79.080046
Australia 518.475032 520.304827 89.423959 509.388404 508.73188 91.602060 518.052115 522.26286 90.728018 527.984576 531.08914 94.427579
Austria 513.621297 517.820480 80.648121 518.007550 518.70232 85.515841 503.654768 510.25940 82.312440 519.201573 523.44276 83.436392
Belgium 531.051761 538.807393 87.886315 536.355786 540.82410 93.399068 529.778358 538.54620 90.261947 527.021139 535.75158 89.969356
Brazil 406.449192 401.002993 70.270195 395.924746 387.76288 73.107053 414.288773 411.53600 77.269773 409.134058 404.83032 71.214265
Bulgaria 458.011597 461.101923 90.527378 453.655290 449.64949 85.992280 457.172773 464.67535 104.699859 463.206729 464.13652 91.553873
Canada 520.788302 522.206180 79.314456 518.159228 517.84544 81.844244 520.754237 523.81228 83.460330 523.451442 524.93472 83.496274
Chile 462.550956 462.163280 76.642318 450.953508 447.54634 83.310208 466.286475 468.32920 75.439152 470.412885 468.61246 80.413416
China-Shanghai 589.074847 598.219367 80.916389 614.172761 623.62518 97.036072 571.148461 577.46740 74.933920 581.903320 589.83588 77.095386
Chinese Taipei 539.605731 547.652477 88.942608 564.431281 571.86477 109.839100 527.299739 535.96341 85.094861 527.086172 532.67439 78.085415
Colombia 413.611325 409.631767 68.672365 396.610942 390.33340 71.293416 425.550083 424.16556 74.590061 418.672950 415.27414 70.591873
Connecticut (USA) 525.205968 528.275773 87.620731 514.423539 513.95077 92.487580 531.160372 535.28953 89.276405 530.033993 532.44125 90.029127
Costa Rica 436.260271 434.082020 60.452263 417.221986 411.36468 64.286195 451.199153 450.16660 66.573791 440.359675 437.46737 63.328437
Croatia 488.900815 489.505627 76.423029 477.615311 473.44602 82.461307 491.582223 494.86206 79.089927 497.504912 498.17230 77.876274
Czech Republic 527.512852 530.776300 82.383948 528.062789 530.38635 91.091518 519.884347 522.80984 82.748909 534.591421 538.45580 82.933202
Denmark 497.750004 500.836067 79.699165 499.050807 499.46254 79.675739 497.574954 501.69020 79.689788 496.624252 499.38458 88.543220
Estonia 532.534475 531.749607 72.574632 527.270385 525.71272 76.933228 522.881654 525.75784 75.198247 547.451386 547.03470 74.785152
Finland 527.809193 532.062563 83.418794 518.568040 518.62439 81.953282 523.957911 530.50852 89.149915 540.901627 546.14881 89.027953
Florida (USA) 487.260415 485.785320 78.800091 472.423267 469.31766 80.636668 497.948514 499.83410 80.004661 491.409463 491.27192 84.872872
France 520.458180 526.261267 86.955963 514.580217 516.98862 88.228863 527.204900 536.28242 95.521411 519.589423 526.33345 86.882899
Germany 538.583492 545.459820 82.198689 536.608820 541.21360 88.918862 531.121947 537.36516 81.122723 548.019709 556.35956 85.588681
Greece 475.482481 479.117100 78.555410 462.186916 462.22934 82.018003 487.969051 495.89176 87.361701 476.291477 478.77658 79.696447
Hong Kong-China 556.946489 565.719917 81.418815 564.782139 572.83846 92.343514 547.998901 556.35501 81.751717 558.058426 565.82434 78.876181
Hungary 501.893961 503.603333 80.669928 492.438548 489.80374 86.454807 503.840267 510.98114 81.942371 509.403068 511.41364 81.640266
Iceland 494.918550 498.567940 84.551070 502.479806 502.96774 85.537726 494.219550 500.55586 87.259418 488.056295 491.27192 90.625073
Indonesia 391.070020 388.645273 60.072193 381.732162 377.32516 64.140928 403.588436 404.46666 66.808635 387.889463 385.99402 61.192600
Ireland 525.262177 526.199853 78.614421 510.844918 510.52346 79.161056 532.688588 534.65410 79.638183 532.253026 534.72584 84.639762
Italy 501.402723 504.813380 82.046322 497.356365 497.82676 85.928831 501.547072 508.83900 87.736737 505.304731 508.80270 85.032704
Japan 548.114453 555.020573 86.460625 543.574980 545.45880 88.708330 546.441753 554.59127 90.473314 554.326625 561.48826 88.483229
... ... ... ... ... ... ... ... ... ... ... ... ...
Macao-China 527.336651 531.493887 75.888750 543.606664 546.74404 87.938347 513.588346 518.60902 75.964475 524.814943 529.13092 72.684268
Malaysia 421.865760 422.147937 71.668119 429.681521 426.12556 76.347563 407.687246 411.58005 76.242127 428.228513 427.53637 72.502645
Massachusetts (USA) 526.534603 528.063820 87.927070 518.328411 517.76756 91.184216 530.888527 530.68254 90.157260 530.386871 532.20812 90.859916
Mexico 429.366584 428.348767 64.357857 425.814207 422.42560 68.583896 436.428426 438.00416 71.448414 425.857118 424.97204 64.166725
Montenegro 422.010822 420.148367 74.546958 416.807963 411.83206 75.307562 431.117686 431.39374 82.669146 418.106816 415.18092 75.743661
Netherlands 527.976489 534.309687 82.579793 532.143716 536.11153 85.136701 520.107919 526.54550 83.894040 531.677833 537.05707 86.325851
New Zealand 531.462888 534.665067 90.416052 520.582797 520.02648 92.070187 535.179188 541.16740 93.997975 538.626678 542.18576 94.057496
Norway 509.177669 511.773487 83.552091 500.805060 501.52672 83.113340 518.965665 522.93013 88.190287 507.762282 510.20141 89.031613
Perm(Russian Federation) 495.899854 498.715083 75.801996 497.150527 496.42466 82.228618 498.620514 503.04190 81.921993 491.928522 496.16746 77.189375
Peru 388.507171 386.174387 74.532894 381.083693 375.14412 78.408569 398.740165 399.51078 84.917468 385.697655 383.75608 70.422857
Poland 527.902928 528.987067 80.365261 525.027965 522.67486 87.250333 525.598809 530.14762 82.127991 533.082009 534.25962 81.046789
Portugal 496.321825 499.124257 81.350373 495.695664 496.19098 87.836252 495.350016 499.84947 84.837997 497.919796 499.94406 80.919707
Qatar 398.566920 388.886700 95.076139 390.851725 376.93566 94.608045 405.328154 403.03690 101.910814 399.520881 387.29950 98.446311
Romania 449.425709 446.082703 72.334843 452.958412 447.04003 75.703836 448.099871 447.78788 80.858320 447.218843 444.27452 72.249095
Russian Federation 491.641855 492.299737 77.917981 492.529438 490.85527 82.019136 486.949818 488.99417 84.104338 495.446309 497.61282 79.133036
Serbia 455.964960 454.292770 79.207898 458.271536 452.88206 84.670004 456.190719 455.96772 84.268055 453.432626 451.45467 79.364365
Singapore 557.247591 564.063387 95.310676 574.585067 580.93940 99.931431 543.983636 550.03586 94.237880 553.174071 561.11528 98.362326
Slovak Republic 488.772014 491.235507 90.577965 497.741205 496.07416 93.877594 480.332649 488.43586 94.609615 488.242188 491.17866 92.443571
Slovenia 489.674195 489.134547 83.368172 492.692335 489.45317 85.488782 471.739240 474.22957 88.456378 504.591010 506.09848 86.276429
Spain 505.705798 510.425873 77.811263 503.024471 505.57718 82.773478 502.147558 507.96525 83.476437 511.945364 515.42336 78.852530
Sweden 498.065955 498.718760 84.825225 492.147998 490.93319 83.142317 500.862405 505.23587 93.012866 501.187461 501.29615 87.990670
Switzerland 517.350233 519.813727 80.438310 530.805305 532.02212 86.891021 508.189178 511.94348 81.102981 513.056217 514.77060 82.197446
Thailand 453.962086 447.190913 79.690959 445.239182 432.31810 89.315704 456.828276 456.33512 79.471152 459.818800 453.22640 79.411693
Tunisia 410.012652 411.418113 69.089119 400.479338 395.94172 73.378499 418.773974 425.27758 77.745144 410.784642 411.45094 70.299421
Turkey 474.988330 467.382867 77.849948 462.356431 450.38948 89.309825 487.729666 486.20118 79.650924 474.878892 468.61244 74.784851
United Arab Emirates 451.424692 450.619887 84.070116 444.163179 439.25066 84.651005 451.764182 455.32614 89.009762 458.346714 456.49012 87.750394
United Kingdom 515.292547 516.633207 81.721482 505.232681 504.75932 83.637721 514.252312 516.70266 82.270576 526.392648 527.91870 86.684422
United States of America 502.147153 501.351153 83.687967 490.746849 487.15530 85.151639 508.367090 510.74058 84.470386 507.327520 508.14994 87.946118
Uruguay 438.225009 436.649613 77.878517 434.493752 431.30550 79.738619 437.178559 436.72102 82.220391 443.002715 443.24880 83.409308
Vietnam 520.961851 522.245867 68.807115 516.692616 515.35286 79.945191 513.332775 516.19381 66.161097 532.860162 535.47182 70.729529

67 rows × 12 columns

In [83]:
plt.figure(figsize=(40,20));

plt.subplot(1,4,1)
top_ranking = pisa.groupby("Country_Code")["global_score"].median().sort_values(ascending=False).index
plt.title("Boxplots showing overall Test Statistics by countries")
sns.boxplot(data=pisa, y="Country_Code", x="global_score", order=top_ranking, palette="viridis_r");

plt.subplot(1,4,2)
top_ranking = pisa.groupby("Country_Code")["math_score"].median().sort_values(ascending=False).index
plt.title("Boxplots showing math_score Statistics by countries")
sns.boxplot(data=pisa, y="Country_Code", x="math_score", order=top_ranking, palette="mako_r");

plt.subplot(1,4,3)
top_ranking = pisa.groupby("Country_Code")["read_score"].median().sort_values(ascending=False).index
plt.title("Boxplots showing read_score Statistics by countries")
sns.boxplot(data=pisa, y="Country_Code", x="read_score", order=top_ranking, palette="mako_r");

plt.subplot(1,4,4)
top_ranking = pisa.groupby("Country_Code")["science_score"].median().sort_values(ascending=False).index
plt.title("Boxplots showing science_score Statistics by countries")
sns.boxplot(data=pisa, y="Country_Code", x="science_score", order=top_ranking, palette="mako_r");


plt.savefig("Boxplots_Best_Countries.png")

Key findings
Not unexpectedly the top students come from regions in China. My country (Germany) however isn't that bad as always considered by ourselves, always beeing represented in the top 10.

In [84]:
pisa.to_csv("pisa_cleaned_df.csv", index=False)
In [ ]: